| Health Outcome | Mean | Standard Error |
|---|---|---|
| Diabetes Prevalence | 11.2% | 3.3% |
| Lower Estimate | 10.5% | 3.2% |
| Upper Estimate | 11.9% | 3.5% |
| County Demographic | Mean | Standard Error |
|---|---|---|
| Population | 4242 | 1581 |
| Distance | Mean | SE |
|---|---|---|
| Haversine | 1.808 | 1.853 |
| Route-Based | 2.572 | 2.406 |
| Difference | 0.764 | 0.700 |
Can we use a function of distance to healthy food retailers to quantify food access in Forsyth County, North Carolina, even if this function is subject to misclassification?
Can we estimate the relationship between low food access and diabetes prevalence?
Functions of sensitivity and specificity can be used to estimate prevalence of an imperfectly classified outcome (Speybroeck et al., 2013).
Binary variables may be created from continuous ones with non-differential error, but the new variables may still have differential error (Flegal et al., 1991).
Validation data can be used to build a model for \(X\) and use it for multiple imputation (Cole et al., 2006).
Naive regression models, where error-prone \(X^∗\) is used in the place of \(X\) will have regression coefficients that are biased by a function of the sensitivity and specificity (Shaw et al., 2020).
We follow Tang et al., and others by considering a maximum likelihood estimator (MLE) that incorporates both queried and unqueried observations.
We compare the gold standard approach to the following approaches:
The coefficient we are looking to estimate has true value \(\beta_1 = 2\) throughout the simulation study.
We are interested in finding the coefficient vector \(\boldsymbol{\beta}\) from the Poisson model of \(Y \mid X, \boldsymbol Z\).
The log-likelihood function is constructed based on the available data, and we build up from the ideal to reality through the following three cases.
\[P(Y,X,\boldsymbol{Z}) = P_\boldsymbol{\beta}(Y \mid X, \boldsymbol{Z})P(X \mid \boldsymbol{Z})P(Z)\]
\[P_{\boldsymbol{\beta}}(Y,X,\boldsymbol{Z}, X^*) \propto P_{\boldsymbol{\beta}}(Y \mid X, \boldsymbol{Z})P(X \mid X^*, \boldsymbol{Z}).\]
We first handle the \(n\) queried observations.
\(P(Y,X,\boldsymbol{Z}, X^*)\)
\(\phantom{PPPPP} = P_{\boldsymbol{\beta}}(Y \mid X, X^*, \boldsymbol{Z})P_{\boldsymbol{\eta}}(X \mid X^*, \boldsymbol{Z})P(X^*, \boldsymbol{Z})\) \(\phantom{PPPPP} = P_{\boldsymbol{\beta}}(Y \mid X, \boldsymbol{Z})P_{\boldsymbol{\eta}}(X \mid X^*, \boldsymbol{Z})P(X^*, \boldsymbol{Z})\)
We now handle the other \(N-n\) observations by marginalizing over \(X\).
\[P(Y,X^*,\boldsymbol{Z}) = \sum\limits_{x=0}^{1}P(Y,X=x,\boldsymbol{Z}, X^*)\]
In the full log likelihood, each observation’s contribution is additive, so we can now use all \(N\) observations along with \(Q_i\), an indicator for the query status of observation \(i\).
We are ready to maximize the equation below using numerical methods in \(R\) to solve for \(\hat{\boldsymbol{\beta}}\) and better understand the relationship of interest.
\[\ell(\boldsymbol{\beta}, \boldsymbol{\eta}) = \sum\limits_{i = 1}^N Q_i\log P_{\boldsymbol{\beta},\boldsymbol{\eta}}(X,X^*,Y, \boldsymbol{Z}) + \sum\limits_{i = 1}^N (1 - Q_i)\log P_{\boldsymbol{\beta},\boldsymbol{\eta}}(Y, X^*, \boldsymbol{Z}).\]
S. R. Cole, H. Chu, and S. Greenland. Multiple-imputation for measurement-error correction. International Journal of Epidemiology, 35(4):1074–1081, 2006.
K. M. Flegal, P. M. Keyl, and F. J. Nieto. Differential misclassification arising from nondifferential errors in exposure measurement. American Journal of Epidemiology, 134(10):1233–1246, 1991.
E. Gucciardi, M. Vahabi, N. Norris, J. P. Del Monte, and C. Farnum. The intersection between food insecurity and diabetes: a review. Current nutrition reports, 3:324–332, 2014.
World Health Organization. Healthy diet, 2019. URL https://iris.who.int/handle/10665/325828.
P. A. Shaw, R. H. Keogh, et al. STRATOS guidance document on measurement error and misclassification of variables in observational epidemiology: part 2—more complex methods of adjustment and advanced topics. Statistics in medicine, 39(16):2232–2263, 2020.
B. E. Shepherd, P. A. Shaw, and L. E. Dodd. Using audit information to adjust parameter estimates for data errors in clinical trials. Clinical Trials, 9(6):721–729, 2012.
N. Speybroeck, B. Devleesschauwer, L. Joseph, and D. Berkvens. Misclassification errors in prevalence estimation: Bayesian handling with care. International journal of public health, 58:791–795, 2013.
L. Tang, R. H. Lyles, C. C. King, D. D. Celentano, and Y. Lo. Binary regression with differentially misclassified response and exposure variables. Statistics in Medicine, 34(9):1605–1620, 2015.
STA791: Fall 2023